dim(ordered_df)
## [1] 117601 23
Here, we can see the revenue increases with the time
###Moving ahead with the type of payments chosen by the customers
From above interactive pie chart we can see that credit card was most preferred payment type followed by the boleto which is brazilian payment type regulated by brazilian federation of Banks and is used for ecommerce and utility payments. Sao Paulo has the most number of shoppers for Olist with almost 40K customers. It is followed by the Rio De Janeiro and Minas Gerais.
## Warning: package 'plotly' was built under R version 4.0.3
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The overall trend of number of increasing customers is positive with month of November in 2017 saw the maximum number of customers.
The above plot has a positive trend with some sort of seasonality
## STL decomposition ### STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships.
With ACF we can see that there is an exponential decay. There is also correlation in the residuals. This combination of ACF and PACF suggests that the underlying time series follows an autoregressive model.
rev_olist %>%
features(revenue, unitroot_kpss)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.708 0.0128
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
After differencing twice, we can see that our data is now almost a white noise
fit_manual <- rev_olist %>% model(arima = ARIMA(revenue ~ pdq(1, 2, 0)))
## Warning: Having 3 or more differencing operations is not recommended. Please
## consider reducing the total number of differences.
report(fit_manual)
## Series: revenue
## Model: ARIMA(1,2,0)(0,1,0)[12]
##
## Coefficients:
## ar1
## -0.5345
## s.e. 0.4145
##
## sigma^2 estimated as 2.645e+10: log likelihood=-96.05
## AIC=196.1 AICc=199.1 BIC=195.99
AICc for the manual p,d,and q is 199.1
auto_cv <- rev_olist %>%
model(ARIMA(revenue ~ pdq(d=2), stepwise = FALSE, approximation = FALSE)) %>%
report()
## Warning: Having 3 or more differencing operations is not recommended. Please
## consider reducing the total number of differences.
## Warning in sqrt(diag(best$var.coef)): NaNs produced
## Series: revenue
## Model: ARIMA(3,2,3)(0,1,0)[12]
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3
## -0.4193 -0.5638 0.3617 -0.1872 -0.8803 0.0981
## s.e. 0.2519 NaN 0.2263 NaN NaN NaN
##
## sigma^2 estimated as 1.978e+10: log likelihood=-91.53
## AIC=197.05 AICc=85.05 BIC=196.68
rev_olist %>%
model(ARIMA(revenue ~ pdq(d=2))) %>%
report()
## Warning: Having 3 or more differencing operations is not recommended. Please
## consider reducing the total number of differences.
## Series: revenue
## Model: ARIMA(0,2,0)(0,1,0)[12]
##
## sigma^2 estimated as 2.841e+10: log likelihood=-96.69
## AIC=195.38 AICc=196.18 BIC=195.33
There are no lags outside the confidence interval and the residuals are normally distributed. This mean the data values have no correlation left.
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(revenue ~ pdq(d = 2), stepwise = FALSE, approximation~ 10.7 0.151
P-value is much greater than 0.05 hence, we can conclude that data values are independent.
auto_cv %>%
forecast(h = 5)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A fable: 5 x 4 [1M]
## # Key: .model [1]
## .model Month revenue .mean
## <chr> <mth> <dist> <dbl>
## 1 ARIMA(revenue ~ pdq(d = 2), stepwise = F~ 2018 Oct N(1368056, 2.4e+10) 1.37e6
## 2 ARIMA(revenue ~ pdq(d = 2), stepwise = F~ 2018 Nov N(1677809, 7.1e+10) 1.68e6
## 3 ARIMA(revenue ~ pdq(d = 2), stepwise = F~ 2018 Dec N(1064850, 8.3e+10) 1.06e6
## 4 ARIMA(revenue ~ pdq(d = 2), stepwise = F~ 2019 Jan N(1443186, 1.3e+11) 1.44e6
## 5 ARIMA(revenue ~ pdq(d = 2), stepwise = F~ 2019 Feb N(1120816, 2e+11) 1.12e6
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 1 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(revenue ~ pdq(d = 2~ Trai~ 2011. 30692. 17529. 0.0751 1.28 0.0229 0.201
## Series: revenue
## Model: ETS(A,A,N)
## Smoothing parameters:
## alpha = 0.2210484
## beta = 0.221048
##
## Initial states:
## l b
## 192170.7 87427.88
##
## sigma^2: 35525306131
##
## AIC AICc BIC
## 579.6612 583.6612 584.8838
## Series: revenue
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.844217
## gamma = 0.0001000081
##
## Initial states:
## l s1 s2 s3 s4 s5 s6 s7
## 1044208 -56493.94 547972.6 48726.19 106859.3 24568.31 -44790.34 -117876.2
## s8 s9 s10 s11 s12
## 69326.98 -64761.56 -1310.199 -120684 -391537.2
##
## sigma^2: 134041930898
##
## AIC AICc BIC
## 608.9139 704.9139 624.5817
components(ets_fit) %>% autoplot()
## Warning: Removed 1 row(s) containing missing values (geom_path).
### Comparing the errors for all the 3 methods:
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ses Training 50764. 219916. 164825. -4.60 25.8 0.215 -0.208
## 2 holt Training -35123. 169584. 122536. -4.98 13.5 0.160 -0.227
## 3 damped Training -9431. 166402. 118018. -1.81 13.4 0.154 -0.197
Damped method gives the least error.
## ETS(M,A,N)
##
## Call:
## ets(y = mrev_olist_ts)
##
## Smoothing parameters:
## alpha = 0.179
## beta = 0.1789
##
## Initial states:
## l = 174700.9185
## b = 87428.8417
##
## sigma: 0.1858
##
## AIC AICc BIC
## 576.9920 580.9920 582.2146
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -46873.1 171922.8 126463.9 -5.447873 13.40152 0.1651765 -0.1330138
ets() gives ETS(M,A,N) as the model with better AICc. M is multiplicative error, A is additive trend and N is with none seasonality. It is a Multiplicative Holt-Winter’s method with additive errors
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Oct 2018 1210681.9 922439.3 1498924 769852.9 1651511
## Nov 2018 1121966.6 834976.7 1408957 683053.3 1560880
## Dec 2018 1033251.4 725369.8 1341133 562387.1 1504116
## Jan 2019 944536.1 592169.5 1296903 405637.9 1483434
## Feb 2019 855820.9 438986.1 1272656 218327.2 1493315
## ETS(M,A,N)
##
## Call:
## ets(y = mrev_olist_ts)
##
## Smoothing parameters:
## alpha = 0.179
## beta = 0.1789
##
## Initial states:
## l = 174700.9185
## b = 87428.8417
##
## sigma: 0.1858
##
## AIC AICc BIC
## 576.9920 580.9920 582.2146
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -46873.1 171922.8 126463.9 -5.447873 13.40152 0.1651765 -0.1330138
We can see that damped model and ETS(M,A,N) are almost similar with damped method has very little error difference from the latter.
##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,N)
## Q* = 7.1444, df = 3, p-value = 0.06743
##
## Model df: 4. Total lags used: 7
## [1] 393166.4
## [1] 421376.7
Henc, we can infer that ETS model is better in terms of error than ARIMA.
For ARIMA, the data was made stationary with seasonal differencing at the lag 12 as the timeseries has monthly data values followed by the normal differencing. Automatic ARIMA model with stepwise and approximation equal to false returns the lowest AICc values.
I went ahead and tried ses, holt and damped method. Among all the three models Damped one had the least errors and was better of them. Next, I opted for ets() function to choose an optimal model based on the data values and it returned ETS(M,A,N) model with errors values almost similar to the damped method. The tsCV() for the ETS gives lesser error than ARIMA. Therefore, conncluding that ETS is performing better on the given time series data values.
RFM (recency, frequency, monetary) analysis is a behavior based technique used to segment customers by examining their transaction history such as
How recently a customer has purchased (RECENCY)? How often they purchase (FREQUENCY)? How much the customer spends (MONETARY_VALUE)?
Why is Customer segmentation Important? Ans: General mass marketing is often expensive, time consuming and sometimes not that responsive. Focussing on particulare segment of customers helps to prevent customer churn, build customer loyalty and also increasing the brand equity.
RFM_Scores[RFM_Scores$CustomerID == 2785,]
## CustomerID Recency_rank Frequency_rank Monetary_rank
## 3407 2785 3 1 3
## r_segment CustomerID
## 1 Active 24852
## 2 Frequent 20739
## 3 Idle 40173
## 4 Inactive 10785
We can see that there are maximum number of Idle customers.
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 5880050)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 5880050)
final_seg <- plotly::plot_ly(RFM_df, x = ~RECENCY, y = ~M_VALUE, z = ~FREQUENCY, color = ~km_clusters)
final_seg
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Above plot shows the clusters of the segmented customers based on their RFM values. (This may not get displayed in the knitted PDF. However, I have created a flexdashboard which contains all the outputs)
Cluster 1: Here the customers have moderate Recency, high Frequency and moderate Monetary value. This segment of the customers are one of the best and loyal customers. Rewarding them with a ways of exciting offers can result in improving the Recency and Monetary value further more.
Cluster 2: This segment has the customers with the lowest Recency, Frequency and Monetary Value. These are the customers are on the verge of churning. The churn can be avoided by offering one time free offers and preimum services trial periods.
Cluster 3: This cluster is similar to the cluster 1 but has a higher Recency value.
Cluster 4: Customers of this segments are one time buyers, which represents the largest segment Olist has. It has preety high value of monetary value. Tailored promotions like giving cashback and limited premium services to increase their frequency.
Review score of 5 was the highest which was given by 57000 customers.
We can infer that the MQL volume grew maximum in January of 2018
Conversion rate also increased with the MQL volume.
References:
Book Rob J Hyndman and George Athanasopoulos. Forecasting: Principles and Practice. 3rd edition, 2020. OTexts. https://otexts.com/fpp3/.
https://towardsdatascience.com/exploring-highcharts-in-r-f754143efda7